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ABSTRACT 

For ambiguous queries, conventional retrieval systems are 
bound by two conflicting goals. On the one hand, they 
should diversify and strive to present results for as many 
query intents as possible. On the other hand, they should 
provide depth for each intent by displaying more than a sin- 
gle result. Since both diversity and depth cannot be achieved 
simultaneously in the conventional static retrieval model, we 
propose a new dynamic ranking approach. Dynamic ranking 
models allow users to adapt the ranking through interaction, 
thus overcoming the constraints of presenting a one-size-fits- 
all static ranking. In particular, we propose a new two-level 
dynamic ranking model for presenting search results to the 
user. In this model, a user's interactions with the first-level 
ranking are used to infer this user's intent, so that second- 
level rankings can be inserted to provide more results rel- 
evant for this intent. Unlike for previous dynamic ranking 
models, we provide an algorithm to efficiently compute dy- 
namic rankings with provable approximation guarantees for 
a large family of performance measures. We also propose 
the first principled algorithm for learning dynamic ranking 
functions from training data. In addition to the theoret- 
ical results, we provide empirical evidence demonstrating 
the gains in retrieval quality that our method achieves over 
conventional approaches. 
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1. INTRODUCTION 

Search engine users often express different information 
needs using the same query. This leads to the well-known 
problem of ambiguous queries, where a single query can rep- 
resent multiple intents. In some cases, the ambiguity in 
intent can be coarse {e.g., queries such as apple, jaguar and 
SVM). In other cases, the distinctions can be more fine- 
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grained (e.g., the query apple ipod with the intent of either 
buying the device or reading reviews). 

Conventional retrieval methods do not explicitly model 
query ambiguity, but simply select a ranking of results by 
maximizing the probability of relevance independently for 
each document [Tt]. A major limitation of this approach is 
that it favors results for the most prevalent intent. In the ex- 
treme, the retrieval system focuses entirely on the prevalent 
intent, but produces no relevant results for the less popular 
intents. Diversification-based methods (e.g. |i4| |23[ [5] |22j) 
try to alleviate this problem by including at least one rel- 
evant result for as many intents as possible. However, this 
necessarily leads to fewer relevant results for each intent. 
Clearly, there is an inherent trade-off between depth (num- 
ber of results provided for an intent) and diversity (number 
of intents served). In the conventional ranked-retrieval set- 
ting, choosing one invariably leads to the lack of the other. 
A natural question that arises in this context is: how can 
we obtain diversity while not compromising on depth? 

We argue that a key to solving the conflict between depth 
and diversity lies in the move from conventional static re- 
trieval models to dynamic retrieval models fS^ that can take 
advantage of user interactions. Instead of presenting a sin- 
gle one-size-fits-all ranking, dynamic retrieval models allow 
users to adapt the ranking dynamically through interaction. 
Brandt et al. ^ have already given theoretical and empirical 
evidence that even limited interactivity can greatly improve 
retrieval effectiveness. However, Brandt et al. 3 did not 
provide an efficient algorithm for computing dynamic rank- 
ings with provable approximation guarantees, nor did they 
provide a principled algorithm for learning dynamic ranking 
functions from training data. In this paper, we resolve these 
two open questions. 

In particular, we propose a new two-level dynamic rank- 
ing model. The intuition behind the models is that the first 
level provides a diversified ranking of results. The system 
then senses the user's interactions with the first-level and in- 
teractively provides a second-level rankings conditioned on 
this feedback. A possible layout for such two-level rankings 
is given in Figure [l] The left-hand panel shows the first- 
level ranking initially presented to the user. The user then 
chooses to expand the second document {e.g. by mousing 
over or clicking) and a second-level ranking is inserted as 
shown in the center panel. Conceptually, the retrieval sys- 
tem maintains two levels of rankings as illustrated in the 
right-hand panel, where each second-level ranking is condi- 
tioned on the head document in the first-level ranking. 

To operationalize the construction and learning of two- 
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Figure 1: A typical two-level ranking for the query "jaguar". A user interested in the animal "jaguar" interacts 
with the first-level ranking (left panel) and can expand results of interest to see additional results (middle 
panel). A two-level rankings can be thought of as a two-dimensional matrix (right panel). 



level rankings in a rigorous way, we define a new family of 
performance measure for diversified retrieval. Many existing 
retrieval measures (e.g., Precision@k, DCG, Intent Cover- 
age) are special cases of this family. We then operationalize 
the problem of computing an optimal two-level ranking as 
maximizing the given performance measure. While this opti- 
mization problem is NP-hard, we provide an algorithm that 
we show has a (1 ^tt) approximation guarantee. 

e e 

Finally, we also propose a new method for learning the 
(mutually dependent) relevance scores needed for two- level 
rankings. Following a structural SVM approach, we learn a 
discriminant model that resembles the desired performance 
measure in structure, but learns to approximate unknown in- 
tents based on query and document features. This method 
generalizes the learning method from 22 to two-level rank- 
ings and a large class of loss functions. In addition to theo- 
retical results, we evaluate the properties of our model, the 
algorithm for computing two-level rankings, and the learn- 
ing methods through a detailed empirical analysis. 



2. RELATED WORK 

Traditional non- diversified methods for retrieval focus on 
ranking documents based on their probability of relevance to 
the query [l7 12 13 . However, these models are problem- 
atic in the case of ambiguous queries, as they tend to favor 
the most common user intent while ignoring the others. 

Diversified retrieval aims to overcome the challenge of 
query ambiguity by providing diversity in search results |4, 
[23, 5, 22, 6 . In the extreme case diversified retrieval meth- 
ods maximize intent coverage, meaning that they aim at cov- 
ering as many intents in the ranking as possible by providing 
just a single document per intent. The methods in [T9| [l6| 
[22] formulate this problem as a set coverage problem. Most 
concretely, 16 proposed a multi-armed bandit algorithm, 
showing that it maximizes the number of users presented 
with at least one relevant document with provable guaran- 
tees. While diversification methods alleviated the problem 
of ignoring less frequent intents, they all explicitly or im- 
plicitly improve diversity at the expense of depth (i.e., they 
present only a few relevant documents for any given intent). 



Recent work by Brandt et al. 3 has focused on addressing 
the above issue. They propose a dynamic ranked-retrieval 
model that allows user interaction. User can interactively 
expand results so that a dynamic ranking gets created on 
the fly. Instead of following a static list of results, users con- 
struct their individual ranking as a path through a ranking 
tree. Since users with different intents take different paths, 
it is possible to tailor both the distribution and content of 
each path. Brandt et al. 3 have shown that this small 
amount of interactivity overcomes the inherent trade-off be- 
tween diversity and depth of conventional static rankings. 
We also follow this idea of dynamic ranking, but with the 
following differences. First, we focus on a simpler and more 
plausible model of user behavior. Unlike in 3 , we do not 
assume that users are willing to provide feedback that is 
more than one level deep (see Section [3]), and we allow users 
to backtrack to a higher level. Unlike the algorithms for 
constructing dynamic rankings presented in 3 , we present 
an algorithm that has provable approximation guarantees. 
Furthermore, our algorithm and model apply to a large class 
of submodular performance measures, which include those 
of [3] and [I6, 22 as special cases. And finally, we propose a 
principled method for learning dynamic ranking functions. 

Our learning method follows a long line of research on 
training retrieval functions 13 . However, with the excep- 
tion of 22 , virtually no other work directly addresses the 
problem of learning diversified retrieval functions. Following 
an idea first presented in 1 19 , Yue and Joachims 22 relate 
diversity in word occurences to diversity in intents. They 
propose to learn this relationship using a structural SVM 
method, where they formulate the discriminant function as 
a coverage problem and optimize intent coverage as the loss 
function. We also employ a similar learning technique and 
also draw a correlation between intents and words. However, 
our learning method goes beyond single-level static ranking 
to predict two-level dynamic ranking, and it can optimize a 
large family of loss functions as defined in Section |4.2| 

The dynamic retrieval model we present in this paper is a 
special case of interactive retrieval. An interactive retrieval 
setting involves multiple interactions between users and a 
system. Our model is most closely related to relevance feed- 
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Table 1: Utility U{dj\ti) of a document dj given an 
intent t^. 



back (e.g. [T| [Ts] [51] [24]), where the system presents a set 
of results and the user provides either implicit or explicit 
feedback. The feedback can then be used by the system 
to update the ranking. Note that the interface sketched in 
Figure [l] is inspired by SurfCanyon.com [t]. 

3. TWO-LEVEL DYNAMIC RANKINGS 

Current methods for diversified retrieval are static in na- 
ture. Such static rankings stay unchanged through a user 
session. On the other hand, a dynamic model can adapt the 
ranking based on interactions with the user. The primary 
motivation for using a dynamic model is the inherent trade- 
off between depth and diversity in static models. Figure [l] 
illustrates the two-level dynamic rankings considered in this 
paper. We now provide a simple quantitative example to 
show how such two-level dynamic rankings can achieve both 
diversity and depth. 

Consider four user intents {ti, t4} and nine documents 
{di, dg} with relevance judgments U{dj\ti) as given in 
Table [l] In this example, we assume that the user intents 
are equally likely. On the one hand, a no n- diversified static 
ranking method could present d7 ^ dg ^ dg as its top three 
documents. This means that users with intents ts and t4 get 
two relevant documents, but it fails to provide any relevant 
documents to users with intents ti and t2- On the other 
hand, a diversified static ranking with dr ^ di ^ d^ as 
the top three documents covers all intents, but no user gets 
more than one relevant document. Therefore, this ranking 
lacks depth. 

As an alternative, now consider a two- level dynamic rank- 
ing as follows. The user is presented with dj ^ di ^ d^ 
as the first-level ranking. Users can now expand any of 
the first-level results to receive a second-level ranking. As- 
sume that users interested in (and thus having intent 
ta or t^) will expand that result and receive a second- level 
ranking consisting of ds and dg. Similarly, users interested 
in di will get d2 and d^. And users interested in d/^ will 
get (is and dQ. Note that every intent is covered in the 
top three results of the first-level ranking. At the same 
time, users with intents ts and t^ receive two relevant re- 
sults in the top three positions of their dynamically con- 
structed ranking dr ^ dg ^ dg ^ di d^] users with 
intent ti also receive two relevant results in the top three 
positions of dr ^ di ^ d2 ^ d^ ^ d^] and users with in- 
tent t2 still receive one relevant result in the top three of 
dr ^ di dA ^ d^ ^ dQ. This example illustrates how 
a dynamic two- level ranking can provide diversity and in- 
creased depth simultaneously. 

In the above example, interactive feedback from the user 
was the key to achieving both depth and diversity. More 
generally, we assume that users interact with the dynamic 
ranking according to the following User Model, which we 



denote as policy rid- While other policies of user behavior 
are possible, we focus on tt^ for the sake of simplicity. Users 
expand a first-level document if and only if that document is 
relevant to their intent. When users skip a document, they 
continue with the next first- level result. When users expand 
a first-level result, they go through the second- level rankings 
before continuing from where they left off in the first-level 
ranking. It is thus possible for a user to see multiple second- 
level rankings. Hence we do not allow documents to appear 
more than once across all two-level rankings. 

Note that this user model differs from the one proposed 
in [3] in several way. First, it assumes only one level of feed- 
back, while the model in [3| assumes that users are willing 
and able to provide feedback many levels of rankings deep. 
Second, we model that users return to the top-level ranking, 
which is not allowed in the model of |3]. We believe that 
these differences make the two-level model easier to under- 
stand for the user and therefore more plausible for practical 
use. 

We now define some notation used in the rest of this paper. 
The set of documents shown on the first level are called the 
head documents. The number of documents shown in this 
level is the length of the first-level, and it is denoted by L. 
The set of documents shown on the second level are called 
the tail documents. A row denotes a particular head docu- 
ment and all the tail documents that follow it in the second 
level. The length of a row, excluding the head document, is 
denoted by W and is referred to as the width. Static rank- 
ings are denoted as while two-level rankings are denoted as 
G = (61, 62, ...0^, ..)• Here 6^ = (dio^dn, dij, ...) refers 
to the i^^ row of a two-level ranking, with dio represent- 
ing the head document of the row and dij denoting the j^^ 
tail document of the second-level ranking. We denote the 
candidate set of documents to rank for a query q by P(^), 
the set of possible intents by T{q) and P[t|Q] Vt G T{q) de- 
notes a distribution over the intents given a query q. Unless 
otherwise mentioned, dynamic ranking refers to a two- level 
dynamic rankings in the rest of this paper. 

4. PERFORMANCE MEASURES FOR DI- 
VERSIFIED RETRIEVAL 

To define what constitutes a good two-level dynamic rank- 
ing, we start by defining the measure of retrieval perfor- 
mance we would like to optimize. We then design our re- 
trieval algorithms and learning methods to maximize this 
measure. In the following, we first consider evaluation mea- 
sures for one-level rankings, and then generalize them to the 
two- level case. 

4.1 Measures for Static Rankings 

Existing performance measures range from those that do 
not explicitly consider multiple intents (e.g. NDCG, Avg 
Prec), to measures that reward diversity. Measures that 
reward diversity give lower marginal utility to a document, 
if the intents the document is relevant to are already well 
represented in the ranking. We call this the diminishing 
returns property. The extreme case is the "intent coverage" 
measure (e.g. 
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It attributes utility only to 
the first document relevant for an intent and no additional 
utility to any additional documents. 

We now define a family of measures that includes a whole 
range of diminishing returns models, and that includes most 



existing retrieval measures. Let ^ : R ^ R with ^(0) = be 
a concave, non-negative, and non- decreasing function that 
models the diminishing returns, then we define utility of the 
ranking ^ = (di , (i2, c?fc) conditioned on a given intent t as 



1^1 



(1) 



The 7i > 72 > ••• > 7fc > are discount factors that de- 
crease with position in the ranking, and U{d\t) is the rele- 
vance rating of document d for intent t. The above defini- 
tion of utility is with respect to a single user intent t. For 
a distribution of user intents P[t|^] for query we take the 
expectation 



(2) 



Note that many existing retrieval measures are special 
cases of the definition above. For example, if one chose g to 
be the identity function, one recovers the intent-aware mea- 
sures proposed in p] and the modular measures defined in 
[3]. Further restricting P[t|^] to put all probability mass on 
a single intent leads to conventional measures like DCG 10 
for appropriately chosen 7^. At the other extreme, chosing 
g(x) — min(x, 1) leads to the intent coverage measure 16, 



19] that assigns utility only to the first relevant document 
for an intent. Beyond these special cases, g can be chosen 
from a large class of functions, implying that this family of 
performance measures covers a wide range of diminishing 
returns models. 

4.2 Measures for Dynamic Rankings 

We are now ready to extend our family of performance 
measures to dynamic rankings. The key change for dynamic 
rankings is that users interactively adapt which results they 
view. 

How users expand first-level results was defined in Sec- 
tion [S] as TTcz. Under tTcz, it is natural to define the utility of 
a dynamic ranking G as follows. For a user intent t and a 
concave, non-negative, and non-decreasing function ^, 
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t^.(e|t) =^(^(7^t/(c?^o|t)+^7^.t/(c?^o|t)[/(c?.,|t)^^ (3) 

Like for static rankings, 71 > 72 > ••• and 7^1 > 7^2 > ... 
are position-dependent discount factors. Furthermore, we 
again take the expectation as in Equation [2] to average over 
multiple user intents to obtain Ug{Q\q). 

Note that the utility of a second-level ranking for a given 
intent is zero unless the head document in the first-level 
ranking has non-zero relevance for that intent. This encour- 
ages second-level rankings to only contain documents rele- 
vant to the same intents as the head document, thus provid- 
ing depths. The first-level ranking, on the other hand, pro- 
vides diversity as controlled through the choice of function 
g. The "steeper" g diminishes returns of additional relevant 
documents, the more diverse the first-level ranking gets. We 
will explore this in more detail in Section [7. 1| 

A key advantage of modeling utility in the above form is 
that it allows for an efficient algorithm for finding a dynamic 
ranking which maximizes the utility. We present this in the 
next section. 



5. COMPUTING DYNAMIC RANKINGS 

In this section, we provide a greedy algorithm to com- 
pute dynamic rankings. These rankings are computed by 
maximizing the performance measures defined in the pre- 
vious section. Computing rankings to exactly maximize 
our performance measure is an NP-hard problem. How- 
ever, we show that our two-level greedy algorithm admits 
a (1 — e~"^^"^/'^)-approximation guarantee. 

Our proposed greedy algorithm is given in Algorithm [l] 
We are given a query ^, a candidate set of documents T>{q)^ 
intents T{q) with their distribution P[t|^], and a concave, 
non-negative and non-decreasing function g that defines the 
utility in ([2|. Our goal is to construct a two dimensional 
ranking of length L and width W to maximize the perfor- 
mance measure ([2]). In the algorithm, the "overloaded op- 
erator" denotes either adding a document to a row, or 
adding a row to an existing ranking. 

The proposed algorithm works as follows. Every docu- 
ment in the remaining collection is considered as the head 
document of a candidate row. For each candidate row, W 
documents are greedily added to maximize the utility of 
the resulting partial dynamic ranking. Once rows of length 
W are constructed, the row which maximizes the utility is 
added to the ranking. The above steps are repeated until 
the ranking has L rows. 

Algorithm 1 for computing a two-level dynamic ranking. 

Input: {q,V{q),T{q),'P[t\q] : t G T{q)), g{-), length L 
and width W . 

Output: A dynamic ranking G. 
G ^ new_two_level() 
while |G| < L do 

bestU i 00 

for all d e V{q) s.t. d ^ O do 

row ^ new_row(); row. head ^ d 
for j = 1 to VK do 

besWoc ^ argmax^, ^euro^ Ug{e {row d')\q) 
row ^ row bestDoc 
end for 

if Ug(S row\q) > bestU then 

bestU ^ Ug{0 (B row\q); bestRow ^ row 
end if 
end for 

G ^ G bestRow 
end while 



The proposed algorithm is extremely simple and efficient. 
The algorithm requires 0(|T|) space and 0(|T||P|^) time, 
where |T| is the total number of intents and \T>\ is the num- 
ber of candidate documents. The run time of the algorithm 
can be further improved using techniques such as lazy eval- 
uation [14]. 

We now derive an approximation bound for the greedy 
algorithm by relating it to the well-known problem of opti- 
mizing submodular set functions. First, recall the following 
definition of a submodular function. 

Definition 1. Given a set U , a function f : 2^ ^ R is 
said to be submodular iff for all elements u ^ U and all 
sets X and Y , s.t. X C y C [/^ we have 



f{X U {«}) - /(X) > f{Y U {«}) - f{Y). 



(4) 



When a submodular function is monotonic (i.e., f(Y) < 
/(X), whenever Y C X) and normahzed (i.e., /((/)) = 0), 
greedily constructing a set of size k gives an (1 — 1/e) ap- 
proximation 1 15 to the optimal. 

Since the definition of our utility in ^ involves a con- 
cave function, it is not hard to show that finding the next 
best row to add (outer step) is a submodular maximization 
problem. Moreover, given the head document, finding the 
best row (inner step) is also a submodular maximization 
problem. Thus, finding a dynamic ranking to maximize our 
utility is a nested submodular maximization problem. Since 
submodular function maximization is a hard problem, we 
can only find an approximately good row (rather than the 
best greedy row) to add in each step. In spite of this, we can 
show an approximation guarantee for the greedy two- level 
ranking algorithm. Our result generalizes submdoular func- 
tion maximization from one level to two levels in the same 
spirit as Hochbaum and Pathria 9 generalize the coverage 
problem from one level to two levels. 

Lemma 1. The nested greedy algorithm for the nested sub- 
modular optimization problem has a (1 \ ) approxima- 

tion bound. 

Proof. The submodular function in question is denoted 
/. We have / normalized since the utility of the empty 
ranking is 0. Further, / is monotonic since the score can 
only increase on adding more rows. 

Let Si be the solution of the method after i iterations of 
the outer step of the greedy algorithm. Let OPT be the 
optimal solution to the problem with k elements. First, we 
define 

5i = fiSi)-f{S,-i). (5) 

By monotonicity of the function /, we have: 

^i,f{OPT)<f(SiyjOPT). (6) 

Since at every step we greedily pick the best element, by 
submodularity we get: 

f(S^yJOPT)<f(S^)+k5^+l. (7) 

The above inequality follows from the fact that adding the 
elements of OPT to the current solution has no more ben- 
efit than k times the benefit achieved by the current best 
element. 

In the problem that we are considering, finding the best 
element (i.e., the inner step to get Si from Si-i) is submod- 
ular as well. Hence, we are not assured of finding the best 
element to add to Si-i] we can only obtain a /^-approximate 
solution (where /3 = 1 - -). Thus we have = /3 x dl""^^ . In 
this case the inequality becomes: 

f{S^ U OPT) < f{S^) + 
which along with ([6| gives, 

5^+l > PifiOPT) - f{S,))/k. 
The above inequality in conjunction with ([sj implies, 
f[S,+i) > f{S^) + PifiOPT) - f{S.))/k 

= (1 - f + f /(OPT). (8) 

Using the above inequality, we can show by induction that 
f{Si) > (l-(l-f )0/(OPT). The base case with i = 1 can 



be easily shown. For the induction step, using the inequality 
([8| and the induction hypothesis we get: 

/(S.+i) > (1 - ^)(l - (1 - ly)f{OPT) + ^f{OPT) 

= (1 - (1 - iy+')fioPT). 

Thus, after k steps, the final solution S satisfies f{S) > (1 — 
(1 - if)f{OPT) which implies f{S) > (1 - e-^)f{OPT). 
We get the required result by substituting the value of /3. □ 

6. LEARNING DYNAMIC RANKINGS 

In the previous section, we showed that a dynamic rank- 
ing can be efficiently computed when all the intents and rel- 
evance judgments for a given query are known. Of course, in 
practice these are not available. In this section, we therefore 
propose a learning algorithm that can predict dynamic rank- 
ings on previously unseen queries. Following the approach 
of Yue and Joachims 22 , our algorithm makes use of word- 
level features to discriminatively learn a model of the intent 
distribution for new queries. In particular, given a training 
set of documents with known intents, our algorithm learns 
the weight vector of a linear discriminant function. This 
discriminant function can then be used in Algorithm [l] as a 
substitute for intents and relevance judgments in order to 
predict dynamic rankings on unseen queries. 

We now describe our learning approach, which is based on 
structural SVMs [20]. Our goal here is to learn a mapping 
from a query g to a dynamic ranking G. We pose this as the 
problem of learning a weight vector w G from which we 
can make a prediction as follows: 

^w(^) — argmax w^^(^, 6). (9) 
o 

In the above equation ^(^, G) G is a joint feature-map 
between the candidate set of documents V{q) and G given 
a query 

In the structural SVM framework, given a set of training 
examples, (^*, G*)^=i, a discriminant function is obtained by 
minimizing the empirical risk ^ X^^^i A(G\/iw(^*)) where 
A is a loss function. 

The above equation requires the knowledge of G* in order 
to compute empirical risk. However, in practice, we are not 
given G* with the training documents. Assuming that we 
are given {q\V{q'),T{q%'P[t\q] : t G T{q'))'l=i, we first 
compute a dynamic ranking G* that maximizes the utility 
Ug (approximately) from Algorithm [l] These G*'s will be 
treated as the training examples in the rest of this section. 

A key aspect of structural SVMs is to appropriately de- 
fine the joint-feature map for the problem at hand. For our 
problem, the joint-feature map in ([9| is defined such that: 

w^*((z,e) := ^l4>vUg{e\v) wj<^,(e), (lo) 

where Vx>{q) denotes an index set over the words in the can- 
didate set T>{q). The vector (j)v denotes word-level features 
(for example, how often a word occurs in a document) for 
the word corresponding to index v. The utility Ug{Q\v) is 
analogous to ^ but is now over the words in the vocabu- 
lary (rather than over intents). In particular, a document 

^Strictly, the joint feature-map should be ^(P(q), G|q) for a 
given query q. For brevity, we simply denote this as ^(^, G). 



provides utility U{d\v) for a word v, if that word occurs in 
the document. The word-level features are reminiscent of 
the features used in diverse subset prediction [22]. The key 
assumption is that the words in a document are correlated 
with the intent. This seems natural since documents rele- 
vant to the same intent are likely to share more words than 
documents that are relevant to different intents. 

The second term in Eg. |1Q| captures the similarity between 
head and the tail documents. In this case, VT>(q)xT>(q) de- 
notes an index set over all document pairs in T>{q). Consider 
a particular index s that corresponds to documents di and 
d2 in T>{q). (l)s{0) is a feature vector describing the similar- 
ity between di and d2 in G when di is a head document in 
B and c/2 occurs in the same row as di . If either di is not a 
head document in G or when d2 is not in the same row as di , 
03 (G) is simply a vector of zeros. An example of a feature in 
05 (G) that captures the similarity between two documents is 
their TFIDF cosine. In the second term, the diminishing re- 
turns property does not hold strictly. However, it is easy to 
see that this term is modular (i.e., Equation Q holds with 
equality) and hence our greedy algorithm and its guarantee 
still hold even with these similarity features. 

From the above defintion of the feature-map (10), it is 



clear that w ^(q, G) models the utility of a given dynamic 
ranking G. Thus, a good discriminant function must give a 
higher value to rankings with higher utility. This is achieved 
by solving the following structural SVM optimization prob- 
lem for w 
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The constraints in the above formulation ensure that the 
predicted utility for the dynamic ranking G* is higher than 
the predicted utility for any other G. The objective function 
in ( |11| ) minimizes the empirical risk while maximizing mar- 
gin. The risk and the margin are traded off by the scalar 
parameter C > 0. The loss between G* and G is given by: 
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The above definition ensures that the loss is zero when G = 
G*. It is easy to see that a dynamic ranking G has a large 
loss when its utility is low compared to the utility of G*. 

The quadratic program in Eq. [TT] is convex and it can be 
solved efficiently using a cutting-plane algorithm |20| . 
Even though Eq. ( |11[ ) has an exponential number of con- 
straints, the cutting-plane algorithm can be shown to always 
terminate in polynomial time [iT] [20] . In each iteration of 
the cutting-plane algorithm, the most violated constraints in 
( 11 ) are added to a working set and the resulting quadratic 
program is solved. Given a current w, the most violated 
constraints are obtained by solving: 

argmax w^^(^^ G) + A(G\ G|^^) (12) 
e 

It is easy to see that Algorithm [l] can be used to solve this 
problem, even thought the formal approximation guarantee 
does not hold in this case. While the original structural SVM 
was proposed for exact inference in Eq. (12), the approach 



Statistic 


TREC 


WEB 


No. of queries 


17 


28 


Avg. # of documents per query 


46.3 


76.1 


Avg. # of intents per query 


20.8 


4.5 


Avg. # of docs with > 1 intent per query 


9.6 


25.6 


Frac. of docs with > 1 intent per query 


0.21 


0.34 


Avg. # of intents per document 


1.33 


1.41 


Frac. of docs on prevalent intent 


0.376 


0.734 



has been shown effective |22l ^ even when only approximate 



Table 2: Key statistics of the two datasets. 



inference is possible. Once a weight vector w is obtained, 
the dynamic ranking for a test query can be obtained from 
Eq. 

7. EXPERIMENTS 

This section explores the properties of our two-level rank- 
ing method empirically. In particular, we first investigate 
how the choice of concave function g impacts diversity and 
depth. We also compared against several static and dynamic 
baselines, and finally evaluate how accurately two- level dy- 
namic rankings can be learned using the Structural SVM 
method. 

All experiments were conducted on two datasets, namely, 
the TREC 6-8 Interactive Track (TREC) and the Diversity 
Track of TREC 18 using the Clue Web collection (WEB). 
Each query in TREC contains between 7 to 56 different man- 
ually judged intents. In the case of WEB, we used 28 queries 
with 4 or more intents. Unless noted otherwise, we consider 
the probability P [t] of an intent proportional to the number 
of documents relevant to that intent. Key statistics describ- 
ing the two datasets are provided in Table [2] Note that the 
two datasets differ vastly in terms of some criteria, therefore 
spanning a wide range of application scenarios. In particu- 
lar, the most prevalent intent covers 73.4% of all queries in 
the WEB dataset, while the most prevalent intent in TREC 
is far less dominating with 37.6%. 

Unless noted otherwise, the number of documents in the 
first- level ranking is set to 5. The width of the second- level 
rankings is set to 2 (i.e. one head document plus 2 second- 
level results). For simplicity, we chose all factors 7^ and 
7ij in Equations ([T]) and (|3]) to be 1. Further, we chose 
lJ[d\^) — 1 if document d was relevant to intent t and set 
lJ[d\t) — otherwise. 

7.1 Controlling Diversity and Depth 

The key design choice of our family of utility measures 
is the concave function g. Since Algorithm [l] directly opti- 
mizes utility, we first explore how different choices of g affect 
various properties of the two-level rankings produced by our 
method. 

We experiment with four different concave functions ^, 
each providing a different diminishing-returns model. At one 
extreme, we have the identity function g(x) — x which cor- 
responds to modular returns (i.e. Eq. (Q holds with equal- 
ity). Using this function in Eq. (fTk leads to the intent-aware 
Precision measure proposed in [2], and it is the only func- 
tion considered in 3 . We therefore refer to this function as 
PREC. It is not hard to show that Algorithm[l]actually com- 
putes the optimal two-level ranking for this choice of g. On 
the other end of the spectrum, we study g(x) — min(x,2). 
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Figure 2: Illustrating the diminishing-returns prop- 
erties of four concave functions. 
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PREC 


0.315 


0.302 


0.294 


0.164 


SQRT 


1.612 


1.664 


1.659 


1.333 


LOG 


1.216 


1.267 


1.27 


1.046 


SAT2 


1.18 


1.335 


1.349 


1.487 



Table 3: Performance when optimizing and evaluat- 
ing using different performance measures for TREC. 



Eval. 


Optim. 


PREC 


SQRT 


LOG 


SAT2 


PREC 




0.746 


0.731 


0.714 


0.443 


SQRT 




3.083 


3.132 


3.118 


2.472 


LOG 




2.236 


2.297 


2.303 


1.908 


SAT2 




1.773 


1.882 


1.892 


1.984 



Table 4: Same as Table H for WEB. 



Figure 3: Average number of intents covered (left) 
and average number of documents for prevalent in- 
tent (right) in the first-level ranking. 



By remaining constant after two, this function discourages 
presenting more than two relevant documents for any in- 
tent. The measure obtained using this function in Eq. ([3| 
will be referred to as SAT2 (short for "satisfied after two^. 
In between these two extremes, we study the square root 
function (SQRT) g{x) = ^/x and the log function (LOG) 
g{x) = log(l + x). A plot of all four functions is shown in 
Figure [2] 

To explore how dynamic rankings differ for different choices 
of g, we used Algorithm [l] to compute the two-level rank- 
ings (approximately) maximizing the respective measure for 
known relevance judgments U{d\t) and P[t|^]. Figure [s] 
shows how g influences diversity. The left-hand plot shows 
how many different intents are represented in the top 5 re- 
sults of the first-level ranking on average. The graph shows 
that the stronger the diminishing-returns model, the more 
different intents are covered in the first- level ranking. In 
particular, the number of intents almost doubles on both 
dataset when moving from PREC to SAT2. In return, the 
number of documents on the most prevalent intent in the 
first-level ranking decreases, as shown in the right-hand plot. 
This illustrates how the choice of g can be used to elegantly 
control the desired amount of diversity in the first- level rank- 
ing. 

Tables |3] (TREC) and [2] (WEB) provide further insight 
into the impact of g, now also including the contributions of 
the second- level rankings. The rows correspond to different 
choices for g when evaluating expected utility according to 
Eq. (|3]), while the columns show which g the two-level rank- 
ing was optimized for. Not surprisingly, the diagonal entries 
of Tables [3] and ^ show that the best performance for each 
measure is obtained when optimizing for it. The off-diagonal 
entries show that different g used during optimization lead 
to substantially different rankings. This is particularly ap- 
parent when optimizing the two extreme performance mea- 
sures PREC and SAT2; optimizing one invariably leads to 
rankings that have a low value of the other. In contrast. 



optimizing LOG or SQRT results in much smoother behav- 
ior across all measures, and both seem to provide a good 
compromise between depths (for the prevalent intent) and 
diversity. 

7.2 Static vs. Dynamic Ranking 

The ability to simultaneously provide depth and diversity 
was a key motivation for our dynamic ranking approach over 
conventional static rankings. We now evaluate whether this 
goal is indeed achieved. We compare the two- level rankings 
produced by Algorithm [l] (denoted Dyn) with several static 
baselines. 

First, we compare against a diversity-only static ranking, 
namely the static rankings obtained by maximizing intent 
coverage using the set-coverage algorithm proposed in [22| 
(denoted Stat-Div). Second, we compare against a depth- 
only static ranking, namely the static ranking that optimizes 
utility with g chosen to be the identity function (denoted 
Stat-Depth). Note that Algorithm [l] can be used for this 
purpose by setting the width of the second-level rankings to 
0. And, third, we similarly use Algorithm[l]to produce static 
rankings that optimize SQRT, LOG, and SAT2 (denoted 
Stat-Util). Note that both Dyn and Stat-Util optimize the 
same measure that is used for evaluation. 

To make a fair comparison between static and dynamic 
rankings, we measure performance in the following way. For 
static rankings, we compute performance using the expec- 
tation of Eq. ([T]) at a depth cutoff of 5. In particular, we 
measure PREC@5, SQRT@5, L0G@5 and SAT2@5. For 
two- level rankings, the number of results viewed by a user 
depends on how many results he or she expands. So, we 
truncate any user's path through the two-level ranking after 
visiting 5 results and compute PREC@5, SQRT@5, L0G@5 
and SAT2@5 for the truncated path. 

Results of these comparisons are shown in Figures [4] and [5] 
First, we see that both Dyn and Stat-Util outperform Stat- 
Div, illustrating that optimizing rankings for the desired 
evaluation measure leads to much better performance than 
using a proxy measure as in Stat-Div. Note that Stat-Div 
never tries to present more than one result for each intent, 
which explains the extremely low "depth" performance in 
terms of PREC@5. But Stat-Div is not competitive even for 
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Figure 4: Comparing the retrieval quality of Static 
vs. Dynamic Rankings for TREC. 
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Figure 5: Same as Figure [4| for WEB. 



SAT2, since it never tries to provide a second result for the 
more prevalent intents. Second, at first glance it may be sur- 
prising that Dyn outperforms Stat-Depth even on PREC@5, 
despite the fact that Stat-Depth explicitly (and globally op- 
timally) optimizes depth. As an explanation, consider the 
following situation where A is the prevalent intent, and there 
are three documents relevant to A and B and three rele- 
vant to A and C. Putting those sets of three documents 
into the first two rows of the dynamic ranking provides bet- 
ter PREC@5 than sequentially listing them in the optimal 
static ranking. 

Overall, Figures |4] and [5] show that the dynamic rank- 
ing method outperform all static ranking schemes on all the 
metrics - in many cases with a substantial margin. This 
gain is more pronounced for TREC than for WEB. This can 
be explained by the fact that WEB queries are less ambigu- 
ous, since the single most prevalent intent accounts for more 
than 70% of all queries on average. 

7.3 Width of Second-Level Rankings 

In the previous experiments the width of the second-level 
rankings was limited to 2. To study the effect of width, we 
varied it from (i.e. single-level, static) to 4. In each case 
we obtained a dynamic ranking optimized for the respective 
measure from Algorithm [l] We again use the truncated 
metrics as defined in Section [7.21 for evaluation. The results 
are shown in Figure |6] Performance generally increases with 
increasing width on TREC. However, note that increasing 
width for SAT2 does not help much beyond width 1, which 
is to be expected. The improvements from increased width 
are less strong on WEB, where not much gain is provided 
beyond width 1. Again, this can be explained by the lower 
amount of query ambiguity. 

7.4 Learning Two-level Ranking Functions 

So far we have evaluated how far Algorithm [l] can con- 
struct effective two-level rankings if the relevance ratings 
are known. We now explore how far our learning algorithm 
can predict two-level rankings for previously unseen queries. 
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Figure 6: Retrieval performance when the width of 
the second-level ranking is varied for TREC (left) 
and WEB (right). 



For all experiments in this section, we learn and predict 
using SQRT as the choice for since it provides a good 
tradeoff between diversity and depth as shown above. We 
performed cross-validation as follows and report test-set per- 
formance averaged over all splits. 

For TREC, each test set consisted of a single held-out 
query For each remaining set of 16 queries, 4 further splits 
were made such that 12 were used for training and 4 were 
used for validation. For WEB, we divided the data into 28 
splits of 16 training, 8 validation and 4 testing. Queries 
were split such that all the queries were equally often in 
the training, test and validation sets respectively. The C 
parameter of the structural SVM was varied from 10 ~^ to 
10~^. The C value corresponding to the best performance 
on the validation set was picked for each split. 

To compute features, we performed standard preprocess- 
ing such as tokenization, stopword removal and Porter stem- 
ming. Since the focus of our work is about diversity and not 
about relevance, we rank only those documents that are rel- 
evant to at least one intent of a query. This simulates a 
candidate set that may have been provided by a conven- 
tional retrieval method. This setup is similar to that used 
by Yue and Joachims |22 . 

Many of our features in (1)^ follow those used in |22j . These 
features provide information about the importance of a word 
in terms of two different aspects. A first type of feature 
describes the overall importance of a word. These features 
capture, for example, the intuition that a word appearing in 
10 documents in the candidate set is more important than 
a word appearing in only one. Examples include: 

• Word appears in at least x% of the documents? 

• Word appears in the title of at least x% of the docu- 
ments? 

A second type of feature describes the importance of a 
word in a particular document. This captures the intuition 
that a word that appears 10 times in a document is more 
important than a word that appears only once. Examples 
include: 

• Word appears with frequency of at least y% within the 
document? 

• Word that appears in x% of the documents, appears 
with frequency of at least y% within the document? 

Finally, we also use features that model the relationship 
between the documents in the second-level ranking and the 
corresponding head document of that row. Examples of this 
type of feature include: 
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Figure 7: Performance of learned retrieval functions, 
comparing static vs. dynamic rankings for TREC. 




Figure 8: Same as Figure [7| for WEB. 



• TFIDF similarity of both documents, binned into mul- 
tiple binary features, 

• Number of common words that appear in both docu- 
ments with frequency at least x%. 

Dynamic vs. Static. 

In the first set of experiments, we compare our learning 
method (Dyn-SVM) for two-level rankings with two static 
baseline. The first static baseline is the learning method 
from [22] which optimizes diversity, and is henceforth re- 
ferred to as Stat-Div. It is one of the very few learning 
methods for learning diversified retrieval functions, and it 
was shown to outperform non-learning methods like Essen- 
tial Pages [19]. We also consider a random static baseline 
(referred to as Stat-Rand), which randomly orders the candi- 
date documents. This is a competent baseline, since all our 
candidate documents are relevant for at least one intent. 

Figure [7| shows the comparison between static and dy- 
namic rankings for TREC. Dyn-SVM substantially outper- 
forms both static baselines across all performance metrics, 
mirroring the results we obtained in Section fL2\ where the 
relevance judgments were known. This shows that our learn- 
ing method can effectively generalize the multi-intent rele- 
vance judgments to new queries. On the less ambiguous 
WEB dataset. Figure [S] shows again that the differences be- 
tween static and dynamic rankings are smaller. While Dyn- 
SVM substantially outperforms Stat-Rand, Stat-Div is quite 
competitive on WEB. 

Learning vs. Heuristic Baselines. 

We also compare against alternative methods for con- 
structing two-level rankings. In particular, we extend the 
static baselines Stat-Rand and Stat-Div using the following 
heuristic. For each result in the static ranking, we add a 
second-level ranking using the documents with the highest 
TFIDF similarity from the candidate set. This results in two 
dynamic baselines, which we call Dyn-Rand and Dyn-Div. 

The results are shown in Figure |9] for TREC and in Fig- 
ure ^] for WEB. Since we are now comparing two-level 




Figure 9: Comparing learned dynamic rankings with 
heuristic baselines for TREC. 




Figure 10: Same as Figure |9] for WEB. 



rankings of equal size, we measure performance in terms 
of expected utility. On both datasets Dyn-SVM performs 
substantially better than Dyn-Rand. This implies that our 
method can effectively learn which documents to place at 
the top of the first-level ranking. Surprisingly, simply ex- 
tending the diversified ranking of Dyn-Div using the TFIDF 
heuristic produces dynamic rankings are are competitive 
with Dyn-SVM. In retrospect, this is not too surprising for 
two reasons. First, our experiments with Dyn-SVM use 
rather simple features to describe the relationship between 
the head document and the documents in the second-level 
ranking - most of which are derived from their TFIDF sim- 
ilarity. Stronger features exploiting document ontologies or 
browsing patterns could easily be incorporated into the fea- 
ture vector. Second, the learning method of Dyn-Div is ac- 
tually a special case of Dyn-SVM when using the SATl loss 
(i.e. users are satisfied after a single relevant document) and 
second-level rankings of width 0. However, we argue that it 
is still highly preferable to directly optimize the desired loss 
function and two-level ranking using Dyn-SVM, since the 
reliance on heuristics may fail on other datasets. 

8. CONCLUSIONS AND FUTURE WORK 

We proposed a two-level dynamic ranking approach that 
provides both diversity and depth for ambiguous queries by 
exploiting user interactivity. In particular, we showed that 
the approach has the following desirable properties. First, 
it covers a large family of performance measures, making it 
easy to select a diminishing returns model for the application 
setting at hand. Second, we presented an efficient algorithm 
for constructing two-level rankings that maximizes the given 
performance measure with provable approximation guaran- 
tees. Finally, we provided a structural SVM algorithm for 
learning two-level ranking functions, showing that it can ef- 
fectively generalize to new queries. 

The idea of dynamic ranking models that allow and ac- 
tively anticipate user interactions, as well as the proof that 
such models can be learned and implemented with provable 



approximation guarantees, opens a wide range of further 
questions. First, we need user studies that investigate what 
types of user-interaction poUcies TVd are most accurate in 
practice. While the two-level model used in this paper ap- 
pears to be more plausible than the infinite- level model of 
a detailed user study needs to investigate this question. 
Second, the learning approach presented in this paper re- 
quires relevance judgments for each intent of a query. While 
manually collecting such judgments is feasible in commer- 
cial settings like Web search engines, it would be desirable 
to have algorithms that can learn such models from implicit 
feedback in settings with resource constraints. Finally, there 
is a whole range of additional information that could be in- 
corporated into the model. For example, a taxanomy (of 
words or documents) is likely to provide valuable features 
for modeling the dependencies between head document and 
the results in the second-level ranking. 
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